algorithmic regularization
Algorithmic Regularization in Tensor Optimization: Towards a Lifted Approach in Matrix Sensing
Gradient descent (GD) is crucial for generalization in machine learning models, as it induces implicit regularization, promoting compact representations. In this work, we examine the role of GD in inducing implicit regularization for tensor optimization, particularly within the context of the lifted matrix sensing framework. This framework has been recently proposed to address the non-convex matrix sensing problem by transforming spurious solutions into strict saddles when optimizing over symmetric, rank-1 tensors. We show that, with sufficiently small initialization scale, GD applied to this lifted problem results in approximate rank-1 tensors and critical points with escape directions. Our findings underscore the significance of the tensor parametrization of matrix sensing, in combination with first-order methods, in achieving global optimality in such problems.
Algorithmic Regularization in Tensor Optimization: Towards a Lifted Approach in Matrix Sensing
Gradient descent (GD) is crucial for generalization in machine learning models, as it induces implicit regularization, promoting compact representations. In this work, we examine the role of GD in inducing implicit regularization for tensor optimization, particularly within the context of the lifted matrix sensing framework. This framework has been recently proposed to address the non-convex matrix sensing problem by transforming spurious solutions into strict saddles when optimizing over symmetric, rank-1 tensors. We show that, with sufficiently small initialization scale, GD applied to this lifted problem results in approximate rank-1 tensors and critical points with escape directions. Our findings underscore the significance of the tensor parametrization of matrix sensing, in combination with first-order methods, in achieving global optimality in such problems.
Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced
We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i.e. This result implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers. Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization. Inspired by our findings for gradient flow, we prove that gradient descent with step sizes \eta_t O(t { (1/2 \delta)}) (0 \delta\le1/2) automatically balances two low-rank factors and converges to a bounded global optimum.
Reviews: Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced
Edit after author feedback, I do not wish to change my score: - To me balanced positive quantities is not only about their difference. They should have similar order of magnitude, the difference between 1e-13 and 1 is pretty small but they are clearly unbalanced. The same goes for "Theta" and "poly" notation. None of the statements of the paper involving these notations has this feature: the variables epsilon, d, d1 and d2 are fixed and are never quantified with a "forall" quantifier. The fact that a notation is standard does not mean that it cannot be misused. The authors consider deep learning models with a specific class of activation functions which ensures that the model remains homogeneous: multiplying the weight of a layer by a positive scalar and dividing the weights of another layer by the same amount does not change the prediction of the network.
Algorithmic Regularization in Model-free Overparametrized Asymmetric Matrix Factorization
Jiang, Liwei, Chen, Yudong, Ding, Lijun
We study the asymmetric matrix factorization problem under a natural nonconvex formulation with arbitrary overparametrization. The model-free setting is considered, with minimal assumption on the rank or singular values of the observed matrix, where the global optima provably overfit. We show that vanilla gradient descent with small random initialization sequentially recovers the principal components of the observed matrix. Consequently, when equipped with proper early stopping, gradient descent produces the best low-rank approximation of the observed matrix without explicit regularization. We provide a sharp characterization of the relationship between the approximation error, iteration complexity, initialization size and stepsize. Our complexity bound is almost dimension-free and depends logarithmically on the approximation error, with significantly more lenient requirements on the stepsize and initialization compared to prior work. Our theoretical results provide accurate prediction for the behavior gradient descent, showing good agreement with numerical experiments.
Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization
Weylandt, Michael, Nagorski, John, Allen, Genevera I.
Convex clustering is a promising new approach to the classical problem of clustering, combining strong performance in empirical studies with rigorous theoretical foundations. Despite these advantages, convex clustering has not been widely adopted, due to its computationally intensive nature and its lack of compelling visualizations. To address these impediments, we introduce Algorithmic Regularization, an innovative technique for obtaining high-quality estimates of regularization paths using an iterative one-step approximation scheme. We justify our approach with a novel theoretical result, guaranteeing global convergence of the approximate path to the exact solution under easily-checked non-data-dependent assumptions. The application of algorithmic regularization to convex clustering yields the Convex Clustering via Algorithmic Regularization Paths (CARP) algorithm for computing the clustering solution path. On example data sets from genomics and text analysis, CARP delivers over a 100-fold speed-up over existing methods, while attaining a finer approximation grid than standard methods. Furthermore, CARP enables improved visualization of clustering solutions: the fine solution grid returned by CARP can be used to construct a convex clustering-based dendrogram, as well as forming the basis of a dynamic path-wise visualization based on modern web technologies. Our methods are implemented in the open-source R package clustRviz, available at https://github.com/DataSlingers/clustRviz.
- North America > United States > Washington > King County > Bellevue (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)
- (9 more...)